In the last session, we worked with some pre-trained word embedding models and found out that they still come with a lot of problems. The next breakthrough in the development of embeddings were transformer models (Vaswani et al. 2017). They have several advantages which lead to models that
Can take larger contexts into account when training (remember the limited window we used before)
Can be trained much more efficiently and can hence take in even more texts
Can have several embeddings for each word depending on the context, finally moving away from the bag-of-words paradigm
Can be fine-tuned on new data which contains different vocabulary
However, compared to other approaches like naive bayes or svm algorithms, we are still relatively early for this new technology. The step that happened about 10-15 years ago when many of the things were implemented in R has not really happened yet. Meanwhile, the models also only run on new powerful hardware since the required matrix computations are slow on CPUs and need a GPU instead.
So this session is currently more a preview than an actual hands-on tutorial.
R wrappers for large language models
Another problem with LLMs is that they are predominanlty controlled from Python. R has excellent wrappers for languages like C, C++, Rust or Java, but Python still falls a little behind in terms of comfort of usage. Packages like spacyr and grafzahl try to employ Python anyway through the reticulate compatibility layer. (They do still have some issues to figure out.)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
imdb <-readRDS("data/imdb.rds")set.seed(1)split <-initial_split(data = imdb, prop =3/4, # the prop is the default, I just wanted to make that visiblestrata = label # this makes sure the prevalence of labels is still the same afterwards) imdb_train <-training(split)imdb_test <-testing(split)
If you’re here, you probably already know R so why re-learn things from scratch?
R is a programming language specifically for statistics with some great built-in functionality that you would miss in Python.
R has absolutely outstanding packages for data science with no drop-in replacement in Python (e.g., ggplot2, dplyr, tidytext).
Why not just stick with R then?
Newer models and methods in machine learning are often Python only (as advancements are made by big companies who rely on Python)
You might want to collaborate with someone who uses Python and need to run their code
Learning a new (programming) language is always good to extend your skills (also in your the language(s) you already know)
Getting started
We start by installing the necessary Python packages, for which you should use a virtual environment (so we set that one up first).
Create a Virtual Environment
Before you load reticulate for the first time, we need to create a virtual environment. This is a folder in your project directory with a link to Python and you the packages you want to use in this project. Why?
Packages (or their dependencies) on the Python Package Index can be incompatible with each other – meaning you can break things by updating.
Your operating system might keep older versions of some packages around, which you means you could break your OS by and accidental update!
This also adds to projects being reproducible on other systems, as you keep track of the specific version of each package used in your project (you could do this in R with the renv package).
To grab the correct version of Python to link to in virtual environment:
if (R.Version()$os =="mingw32") {system("where python") # for Windows} else {system("whereis python")}
I choose the main Python installation in “/usr/bin/python” and use it as the base for a virtual environment. If you don’t have any Python version on your system, you can install one with reticulate::install_miniconda().
# I build in this if condition to not accidentally overwrite the environment when rerunning the notebookif (!reticulate::virtualenv_exists(envname ="./python-env/")) { reticulate::virtualenv_create("./python-env/", python ="/usr/bin/python")# for Windows the path is usually "C:/Users/{user}/AppData/Local/r-miniconda/python.exe"}reticulate::virtualenv_exists(envname ="./python-env/")
[1] TRUE
reticulate is supposed to automatically pick this up when started, but to make sure, I set the environment variable RETICULATE_PYTHON to the binary of Python in the new environment:
Optional: make this persist restarts of RStudio by saving the environment variable into an .Renviron file (otherwise the Sys.setenv() line above needs to be in every script):
# open the .Renviron fileusethis::edit_r_environ(scope ="project")# or directly append it with the necessary linereadr::write_lines(x =paste0("RETICULATE_PYTHON=", python_path),file =".Renviron",append =TRUE)
Now reticulate should now pick up the correct binary in the project folder:
library(reticulate)py_config()
python: /home/johannes/Documents/Github/aca_vienna/python-env/bin/python
libpython: /usr/lib/libpython3.11.so
pythonhome: /home/johannes/Documents/Github/aca_vienna/python-env:/home/johannes/Documents/Github/aca_vienna/python-env
version: 3.11.3 (main, Jun 5 2023, 09:32:32) [GCC 13.1.1 20230429]
numpy: /home/johannes/Documents/Github/aca_vienna/python-env/lib/python3.11/site-packages/numpy
numpy_version: 1.24.4
NOTE: Python version was forced by RETICULATE_PYTHON
Installing Packages
reticulate::py_install() installs package similar to install.packages(). Let’s install the packages we need:
reticulate::py_install(c("scikit-learn<1.3.0","bertopic", # this one requires some build tools not usually available on Windows, comment out to install the rest"sentence_transformers","simpletransformers"))
Recreating grafzahl from Python
To demonstrate the workflow for reticulate, we do the same analysis as above, but rely on Python functions
import pandas as pdimport osimport torchfrom simpletransformers.classification import ClassificationModel# args copied from grafzahlmodel_args = {"num_train_epochs": 1, # increase for multiple runs, which can yield better performance"use_multiprocessing": False,"use_multiprocessing_for_evaluation": False,"overwrite_output_dir": True,"reprocess_input_data": True,"overwrite_output_dir": True,"fp16": True,"save_steps": -1,"save_eval_checkpoints": False,"save_model_every_epoch": False,"silent": True,}os.environ["TOKENIZERS_PARALLELISM"] ="false"roberta_model = ClassificationModel(model_type="roberta", model_name="roberta-base",# Use GPU if available use_cuda=torch.cuda.is_available(), args=model_args)
We construct a training and test set from the movie review corpus in R:
Now we can train the model on the coded training set and predict the classes for the test set (if you do not have a GPU, this will take a long time, so maybe do it after the course:
# process data to the form simpletransformers needstrain_df = r.imdb_traintrain_df['labels'] = train_df['label'].astype('category').cat.codestrain_df = train_df[['text', 'labels']]roberta_model.train_model(train_df)# test data needs to be a list
Warning: Since gt v0.9.0, the `colors` argument has been deprecated.
• Please use the `fn` argument instead.
This warning is displayed once every 8 hours.
.metric
.estimator
.estimate
accuracy
binary
0.8966400
kap
binary
0.7932800
precision
binary
0.9094813
recall
binary
0.8809600
f_meas
binary
0.8949935
Running unsupervised learning with BERTopic
I use the data_corpus_guardian from quanteda.corpora show an example workflow for BERTopic. This dataset contains Guardian newspaper articles in politics, economy, society and international sections from 2012 to 2016.
/home/johannes/Documents/Github/aca_vienna/python-env/lib/python3.11/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/johannes/Documents/Github/aca_vienna/python-env/lib/python3.11/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/johannes/Documents/Github/aca_vienna/python-env/lib/python3.11/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
/home/johannes/Documents/Github/aca_vienna/python-env/lib/python3.11/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit()
from sentence_transformers import SentenceTransformerfrom umap import UMAP# To make this example reproducibleumap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)# confusingly, this is the setup parttopic_model = BERTopic(language="english", top_n_words=5, n_gram_range=(1, 2), nr_topics="auto", # change if you want a specific nr of topics calculate_probabilities=True, umap_model=umap_model)# and only here we actually run somethingtopics, doc_topic = topic_model.fit_transform(r.corp_news.texts)# save the modeltopic_model.save("data/5._bertopic")# topic_model=BERTopic.load("data/5._bertopic")
Unlike traditional topic models, BERTopic uses an algorithm that automatically determines a sensible number of topics and also automatically labels topics:
Note that -1 describes a trash topic with words and documents that do not really belong anywhere. BERTopic also supplies the top words, i.e., the ones that most likely belong to each topic. In the code above I requested 5 words for each topic:
BERTopic also classifies documents into the topic categories (again not really how you should use LDA topicmodels). And provides a nice visualisation for trends over time. Unfortunately, the date format in R does not translate automagically to Python, hence we need to convert the dates to strings: